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The rapid development of internet and network technology followed by 
malicious threats and attacks on networks and computers. Intrusion detection 
system (IDS) was developed to solve that problems. The development of IDS 
using machine learning is needed for classifying the attacks. One method of 
the classification is Self-Organizing Map (SOM). SOM able to perform 


classification and visualization in learning process to gain new knowledge. 





However, the SOM has less efficient in learning process when applied in Big 
Keywords: Data. This study proposes Restricted Growing SOM method with clustering 
reference vector (RGSOM-CRV) and Parallel RGSOM-CRV to improve 
SOM efficiency in classification with accuracy consideration to solve Big 
Data problem. Growing process in RGSOM is restricted by maximum nodes 
and growing threshold, the reupdate weight process will update unused 
reference vector when map size already maximum, these two processes solve 
the consuming time of regular GSOM. From the results of this research 
against KDD Cup 1999 dataset, proposed method Parallel RGSOM-CRV 
able to give 91.86% accuracy, 20.58% false alarm rate, 95.32% recall or 
detection rate, and precision is 94.35% and time consuming is outperform 
than regular Growing SOM. This proposed method is very promising to 
handle big data problems compared with other methods. 
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1. INTRODUCTION 

The security threat to internet usage and computer networks is increasing. Several types of new 
attacks on the network appear periodically, this make a chalange to develop a flexible and adaptive network 
security. Developing techniques to detect anomaly-based network intrusion to protect a computer system and 
network from malicious activity attacks call Intrusion Detection System (IDS), as the detection of suspicious 
network traffic and computer usage can not be done by conventional firewalls. 

Some development of IDS based on machine learning technique. The methods used for anomaly- 
based intrusion detection are commonly differentiated into classification and clustering. However, there is 
also a hybridization between clustering and classification for intrusion detection system. In the classification 
method, some studies use single classifier such as KNN [1], Support Vector Machine (SVM) [2], artificial 
neural network [3-6] to solve IDS problem. Other researchers use hybrid methods of heuristic algorithm with 
classifier method [7-9], Multi-level SVM and Extreme Learning Machine with K-Mean [10], Decision Tree 
and SVM [11], Tree Augmented Naive Bayes (TAN) and Reduced Error Pruning (REP) [12]. IDS-related 
studies using clustering include K-Mean, K-Medoids, A-SPOT [13], CANN [14]. 

One method of classification and reduction data that can visualize the learning process is SOM. In 
some studies SOM able to solve the problem of classification with better result [15]. However, SOM 
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experienced constraints in efficiency during the process of learning with large data (Big Data), this have 
high time consuming. The problem is due to the characteristics of SOM that calculate the distance between 
input vector and reference vector to decide the winning neuron on the hidden layer for each epoch. 
Big data problems seem to be faced by Li at al. [1], they just select 5,552 instances from KDD Cup 99 
sample data as the training data, and 5,552 instances as testing data for their experiment, not from the entire 
KDD Cup 99 data. 

Based on research conducted by Alahakoon [16], Growing Self Organizing Map (GSOM) can be 
used to build reference vector gradually based on training data. However, this able to solve the problem in the 
first epoch only, in next epoch, topological map will be larger, and the problem of time consume will 
reappear on next epoch. This study proposes Restricted Growing Self Organizing Map with clustering 
reference vector (RGSOM-CRV) and Parallel RGSOM-CRV to solve the issue on regular GSOM to handle 
big data problems. This research will measure RGSOM-CRV and Parallel RGSOM-CRV efficiency of time 
consuming and accuracy, false alarm rate, precision, and recall. 


2. RESEARCH METHOD 

This research procedure shown in Figure 1. Features from dataset will be selected according to 
selected features, and each feature value will be normalizing before process at RGSOM-CRV. Data train and 
data testing will be treat same way before process in RGSOM-CRV. 





Figure 1. Research procedure 


2.1. KDD Cup 99 Dataset 

Dataset KDD Cup 99 from UCI separate with 10% data train (494,021 instances) and data test 
(4,898,431 instances), it’s have 41 features. This dataset category in 4 attacks class dos, probe, r21, u2r and 
normal category. This research will use all data provided by this dataset to measure the efficient and effective 
of proposed method to handle big data problems. 


2.2. Selected Feature 

Data train will be processes with selected features. According to KDD Cup 99 task 
(http://kdd.ics.uci.edu/databases/kddcup99/task.html) which adapted from Stolfo et al paper [17], there are 
three types feature selection, basic features of individual TCP connection, content features within a 
connection suggested by domain knowledge, and traffic feature computed using a two-second time window. 
In this research basic features of individual TCP connection category type be chosen. Features used in this 
type are duration, protocol type, service, flag, src bytes, dst bytes, land, wrong fragment, urgent. 


2.3. Normalize Feature 

After dataset selected by basic feature type, and then this feature will be normalizing with (1) before 
processed with RGSOM. Where n is normalized value, x is real value, argmin (Fi) is smallest value from i-th 
features, argmax (Fi) is largest value from i-th features. After normalize data will process in RGSOM-CRV 
to be train and then test with data test. 


xX — argmin(Fi) 


n= ———————. (1) 


argmax (Fi)— argmin(Fi) 
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2.4. GSOM 

Growing SOM is modified regular SOM which proposed by Alahakoon [16]. GSOM procedure 
according to Alahakoon are Initialization phase, Growing phase, and Smoothing phase. Figure 2 show new 
node generation from the boundary of the network. Figure 2a is initial node using 4 nodes as reference 
vector. Figure 2b show that high error occurs when winner node distance with reference vector is more than 
growing threshold. Figure 2c show the growing shcema of GSOM. Growing phase will occur if growing 
threshold smaller than distance of winner node. 





Figure 2. New node generation from the boundary of the network 


2.5. RGSOM-CRV 

This research propose RGSOM-CRYV, this method is extend from regular SOM [15] and regular 
GSOM [16]. The different with regular GSOM is the growing of reference vector is restricted by maximum 
size of nodes (MN), and two grid map dimensions. Map will be growing if distance from winning node with 
input node is more that growing threshold (GT). If the length of nodes in map more that MN, map will stop 
growing and reupdate by selected randomly one reference vector which never hit (not be winner yet) by 
input. After select winner node, same with regular SOM, weight of neighborhood will be update. One 
reference vector could be more than one time selected to be winner node called clustering reference vector 
(CRV). Clustering reference vector will be a group of input with similarity weight according to minimum 
GT. CRV method will reduce the size of topographical map, this method can decrease time consuming when 
select winner node process. 

Figure 3 is the flowchart of RGSOM procedure, and Figure 3a is the whole process of RGSOM 
procedure. In initialization process user need setting the maximum map size and maximum node for the map 
this propose for restricted the size of the growing node in the map. User also setting the value of start 
learning rate and stop learning rate, start growing threshold, stop growing threshold, and maxEpoch. 
Initialization process also generate initial nine reference vector nodes for the map, as shown on Figure 4a. 


t 





S(O) = stare (See)'er4 2) 

win = argmin {||x — w,|l} (3) 

w(t + 1) = welt) + wink (x (€) — we (CE) ) (4) 
_ (lbwin= rel? 

hwink = @(t)e ( ae (5) 


Figure 3b is training process flowchart. Update learning rate and update growing threshold using 
monotonic decrement function, figure out by (2), where S(t) is the updated learning rate or growing 
threshold, S,g,¢ 18 starting value, S,,q is ending value, tf is current epoch, t,,q is max epoch. Learning rate 
and growing threshold update used to set current learning rate and growing threshold at current epoch. The 
winner node calculate use regular SOM procedure follow by (3). Where, win is winner node, x is the input 
vector and w, is k-th reference vector. Reference vector are list of nodes generate by first node (nine square 
node which position is in the center of map Figure 4a) and generate by growing function. If winner distance 
more than the growing threshold in the current epoch and generated nodes length smaller than maximum 
node, reference vector will grow with random weight. If generated nodes length more than maximum node, 
unused reference vector will be reupdated. Unused reference vector are nodes that never hit or not yet 
selected to be winner. The scheme of the growing of RGSOM shown on Figure 4c. Neighborhood of winning 
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neuron will be updated use (4), where hyn, is gausian neighborhood function decide by (5). Learning rate 
notation is a(t) and o(t) is neighborhood radius, both are a monotonically decreasing scalar function of t 
follow the rule of (2), 7% is k-th neighborhood of winner node. Selected winner node will be push to the list of 
used nodes. Figure 3c is testing procedure flowchart. At testing purpose, the winner will be chosen from map 
generate by training step and used nodes will be selected to be reference vector for testing input. The winner 
node will be used to calculate True Negative (TN), True Positive (TP), False Negative (FN), False Positive 
(FP) values. True negative is correct prediction which real category is normal and labeled as normal. True 
positive is correct prediction which real attack category and labeled as attack. False negative is incorrect 
prediction which the real category is attack but predicted as normal. False positive is incorrect prediction 
which the real category is normal but labeled as attack. 


=i 


Update Learning Rate 
Update Growing Threshold 
i=0; epoch ++; 





Reupdate Unused Reference Vector 








Generate New Node (Growing) 
Update Generated Nodes 








Choose winner node from used node 
Input Category = winner category 





RE Update Neighborhood 


Update Used Nodes 
I++: 




















Figure 3. Flowchart RGSOM-CRV Procedure. 








Figure 4. Square growing node schema 
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2.6. Measurement 

This experiment has four measurements to evaluate this method, accuracy (ACC), false alarm rate 
(FAR), detection rate (DTR) or recall, precision. Accuracy is total correctly classified example to total 
number of example. Accuracy calculate with (6): 


TP+TN 
ACCS TP+TN4FP+EN * 100% (6) 
False alarm rate is percentage of normal category that have been label as attack to total number of normal 
examples, and calculate as (7): 


FP 


Detection rate or recall is standing for correctly label as attacks to total number of attacks, use (8) as formula. 


TP 
DTR or Recall = apy = 100% (8) 


Precision is being the percentage of correctly label attacks as attacks to total number of instance labeled as 
attacks, and calculate with (9): 


ois 708 
Precision = ——] x 100% (9) 


3. RESULTS AND DISCUSSION 

RGSOM-CRV and Parallel RGSOM-CRV in this experiment will initialize with same start learning 
rate (LRstart), stop learning rate (LRstop), start growing threshold (GTstart), stop growing threshold 
(GTstop), max epoch, map size. This experiment uses LRstart=0.9, LRstop=0.1, GTstart=0.05, GTstop=0.01, 
maxEpoch=4, mapSize=100x100. Maximum node for RGSOM-CRV is 5000 and maximum node for Parallel 
RGSOM-CRV different between protocol, Udp have maximum nodes 3000, and tcp and icmp is 5000, 
different value of this setting is because udp at training instance much fewer that tcp and icmp as shown at 
Table 1. This experiment using computer with specification: Processor Intel Core i7-6500U CPU @2.5GHz 
2.60 GHz, Memory 8 GB, with system 64-bit Operating System. 

Table 2 show the regular GSOM consume more than 2 days and ended in second epoch because 
have memory limit issue due the map growing bigger. From this experiment, GSOM not capable to handle 
large data in our experiment due the limitation of memory and time consuming and this is not whorted to be 
continued. 


Table 1. Distribution of Protocol Type 








Protocol Type Count 
icmp 283.602 
tcp 190.065 

udp 20.354 





Table 2. GSOM Experiment 








ACC FAR DTR/Recall Precision Training Time 
epoch 1 - - - - 02:32:11 
epoch 2 - - - - More than 2 days and then stopped 





The result from five experiments of Parallel RGSOM-CRV show in Table 4. The average Parallel 
RGSOM-CRV accuracy is 91.86% and false alarm rate is 20.58%, recall or detection rate is 95.32%, and 
precision is 94.35%. The average time consume while training using Parallel RGSOM-CRV is 6 hours 33 
minute and 18 second (four epoch). However, time consuming for testing is 46 minutes and 39 second. 

From Table 3 and Table 4 Parallel RGSOM-CRV is outperform than RGSOM-CRV with 91.86% in 
accuracy, false alarm rate is 20.58%, 95.32% for recall, and 94.35% in precision. However, precision from 
both experiment have good result. From Table 3 at third experiment, accuracy of RGSOM-CRV have good 
result than other experiment. This can be happened because RGSOM-CRV is generate randomly and in each 
experiment, so different result may be obtained. RGSOM-CRV maybe could have different result too for 
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different GTStart and GTstop setting. The result from the fifth experiment using Parallel RGSOM-CRV for 
each protocol shown at Table 5. The parallel RGSOM-CRV have total accuracy is 97.27%, false alarm rate is 
12.72%, recall or detection rate is 99.79%, and precision is 96.87%. Icmp protocol have largest accuration, 
false alarm rate, detection rate and largest precission. 

Five experiments for both methods will be evaluated. Table 3 show the result of five experiment 
using RGSOM-CRV. The average RGSOM-CRV accuracy is 51.45% and false alarm rate is 11.80%, recall 
or detection rate is 41.78%, and precision is 93.09%. The average of time consume training using RGSOM- 
CRV is 5 hours and 31 minutes and 1 second. Time consume while testing is 1 hour and 44 minutes and 59 
second. 


Table 3. RGSOM-CRV Experiments Result 








ACC FAR DTR/Recall Precision Training Time Testing Time 
Experiment | 38.22% 18.08% 24.52% 81.21% 05:17:24 01:39:05 
Experiment 2 38.62% 18.69% 27.94% 85.66% 05:25:42 01:32:56 
Experiment 3 97.74% 9.71% 99.61% 97.60% 05:34:30 01:48:12 
Experiment 4 41.12% 7.48% 28.25% 93.78% 05:43:03 01:27:26 
Experiment 5 41.54% 3.87% 27.85% 96.64% 05:37:01 02:17:16 
Average 51.45% 11.80% 41.78% 93.09% 05:31:32 01:44:59 





Table 4. Parallel RGSOM-CRV Experiments Result 








ACC FAR DTR/Recall Precision Training Time Testing Time 
Experiment | 97.54% 11.02% 99.72% 97.26% 08:32:11 00:52:48 
Experiment 2 94.79% 19.93% 99.48% 94.01% 06:15:59 00:42:02 
Experiment 3 78.51% 17.52% 77.20% 93.34% 05:51:22 00:57:07 
Experiment 4 91.21% 42.68% 99.69% 90.32% 06:18:46 00:47:38 
Experiment 5 97.27% 12.72% 99.79% 96.87% 05:48:16 00:33:42 
Average 91.86% 20.58% 95.32% 94.35% 06:33:18 00:46:39 





Table 5. Parallel RGSO-CRV Result Each Protocol at Experiment 5 








Protocol ACC FAR DTR/Recall Precision 
icmp 99.74% 56.80% 100.00% 99.74% 
tcp 93.39% 14.97% 99.44% 90.17% 
udp 98.48% 0.52% 33.41% 49.60% 
Total 97.27% 12.72% 99.79% 96.87% 





For more detail insight we can study with the map generated in each epoch, which shown 
at Figure 5-7. From the visualization shown at Figure 5-7 there are new knowledge of information about the 
training process of Parallel RGSOM. Figure 5 at udp protocol shown that nodes which separate randomly is 
nodes that generated by reupdate unused reference vector process. The randomly separate of some node also 
appear at Figure 7 for TCP and udp protocol. At Icmp protocol from first until fourth epoch shown that 
decreasing of used reference vector number, that mean there are more similar weight in training data. 

Parallel RGSOM-CRV is outperform regular RGSOM in efficiency of time consume, its spend 
average 6 hour and 33 minutes and 18 second while training, and 46 minutes and 39 second for testing. 
Training time consume when using RGSOM-CRV is better than Parallel RGSOM-CRV, this because in 
parallel RGSOM-CRV there are procedure to selecting input according to protocol type. However, time 
consuming for testing using Parallel RGSOM-CRV is better than RGSOM, this because at parallel RGSOM 
generate less used nodes in the map, so time for scanning the winner node more efficient. The problem of 
Regular GSOM for classified big data has been fixed by RGSOM-CRV and Parallel RGSOM-CRV, the 
restricted of nodes length generate by growing threshold make the limitation of map to growing bigger and 
bigger. Clustering reference vector make RGSOM-CRV capable to generalize weight base on growing 
threshold. 

From Table 6, Parallel RGSOM-CRV has lower accuracy than other methods, but let see the number 
of testing data, proposed method have 4,898,431 instances as testing data, and have less feature to process. 
With a larger amount of tested data, Parallel RGSOM-CRV is capable of producing 91.86% accuracy, so this 
method very promising to solve the big data problems in classification. 
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Table 6. Comparison of proposed method with other methods by number of training data, tesing data, 
features, and accuracy. 








Method Training Data __ Testing Data Features ACC % 
KNN [1] 5,552 5,552 15 97.69% 
SVM+ELM base K-mean [10] 494,021 311,029 41 95.75% 
TAN+REP [12] 326,053 167,968 Not provided 98.99% 
Parallel RGSOM-CRV 494,021 4,898,431 9 91.86% 

















— 
0 75 





F P - 



































+ rt rr x3 7 «75 
2% 30 35 40 45 50 5S 60 6S 70 7: 15 20 25 30 35 40 45 50 55 60 65 70 75 80 3 WwW 3 4 4 8 8S OO OS TOOTS 


icmp 1)? upp 





Figure 7. Map generated with Parallel RGSOM-CRV for third epoch at experiment 5 


4. CONCLUSION 

From this experiment Parallel and RGSOM-CRV outperform than regular GSOM in time 
consuming, so this propose method is more efficient than GSOM method, and from comparation with other 
method, the result of Parallel RGSOM is acceptable with 91.86% for accuracy, false alarm rate around 
20.58%, recall or detection rate is 95.32%, and 94.35% in precision. This study also conclude that find the 
best of maximum node will increase the efficiency, and RGSOM generalize capability depend on GTstart and 
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GTStop. The capability to generalize reference vector make accuracy and detection rate acceptable. Finding 
the optimum parameter setting of growing thresholds can be used as a reference for the future research. 
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